The Bioinformatics Resource Center ATAC-seq pipeline provides QC and processed files for use in secondary analysis systems.
This report was written by Doug Barrows and Thomas Carroll at the Rockefeller University Bioinformatics Resource Center with contributions by Elitsa Stoyanova and Hasnahana Chetia from the Heintz laboratory.
Input files required by the pipeline (links contain information about each file format):
Sample data
Reference data
| Input_Files | Name |
|---|---|
| Sample Name | Sample_2 |
| All_FASTQ | 2_S2_R1_001.fastq.gz;2_S2_R2_001.fastq.gz 2_S2_R1_001_copy1.fastq.gz;2_S2_R2_001_copy1.fastq.gz |
| Input_Files | Name |
|---|---|
| Genome | hg38-UCSC |
| Reference Fasta | BSgenome.Hsapiens.UCSC.hg38.fa |
| GTF | TxDb.Hsapiens.UCSC.hg38.knownGene.gtf.gz |
| ID to Symbol map | org.Hs.eg_ID2Symbol.txt |
Quality control files summarized in this report (links navigate to appropriate section of the report):
Processed files generated by the pipeline (links contain information about each file format):
| Output_Files | File_Name |
|---|---|
| Fastp report | Sample_2_pairs1.fq.gz.json |
| Picard AlignmentSummaryMetrics | Sorted_Sample_2.AlignmentSummaryMetrics.txt |
| ATACseqQC summary | Sorted_Sample_2ATACseqQC.json |
| ChIPQC summary | Sample_2.RData |
| BAM file | Sorted_Sample_2.bam |
| bigWig file | Sorted_Sample_2Normalised.bw |
| MACs peak calls | Sample_2_withInput_Sample_2_peaks.narrowPeak |
| Reads/Counts in Peaks | Sorted_Sample_2_subread_Peak_Antibody_ATAC_Consensus_fromMACSisBlacklisted.Counts |
| Reads/Counts in TSSs | Sorted_Sample_2_subread_Peak_TSS_PositionsisBlacklisted.Counts |
Read quality is assessed using the BRC’s Rfastp R package which provides an R interface for the fastp and genocore C libraries.
In contrast to quality control through FastQC, Rfastp will calculate statistics across the complete FQ file instead of a subsample of reads. Additionally Rfastp includes assessment for read pairs instead of disjointed QC for each file of a pair.
Rfastp captures metrics on:
Fastp assesses each read based on a number of quality metrics, and determines whether that read meets a quality filtering threshold. Reads did not pass the quality filter if at least one of the following are true:
The first table shows the quality filtering results while the second table shows specific quality metrics for all of the reads compared to those that passed filtering. The quality scores (i.e. phred score) for each base (e.g. q20 and q30) in the table below reflect the likelihood that the assignment of that base is incorrect.
| Metric | Number_of_reads |
|---|---|
| passed_filter_reads | 52,671,092 |
| failed_filter_reads | 1,364,128 |
| low_quality_reads | 1,255,894 |
| too_many_N_reads | 5,848 |
| too_short_reads | 102,386 |
| Metric | Before_Filtering | After_Filtering |
|---|---|---|
| total_reads | 54,035,220 | 52,671,092 |
| total_bases | 4,106,676,720 | 3,757,697,827 |
| q20_bases | 3,958,677,856 | 3,655,769,311 |
| q20_rate | 0.964 | 0.9729 |
| q30_bases | 3,905,535,209 | 3,614,638,309 |
| q30_rate | 0.951 | 0.9619 |
| read1_mean_length | 76 | 71 |
| read2_mean_length | 76 | 71 |
| gc_content | 0.5102 | 0.5098 |
| Metric | Value |
|---|---|
| Duplication Rate | 0.0798165 |
The insert size represents the length of the original nucleotide fragment (not including adapters) and is estimated based on the overlap of paired-end reads.
For each sequencing cycle (separated by read 1 or read 2 for paired-end), which corresponds to each position in the read, the proportion for each nucleotide (or combination of G and C) across all reads is shown.
For each sequencing cycle (separated by read 1 or read 2 for paired-end), which corresponds to each position in the read, the average phred quality score is shown. The quality score is equal to -10*log10(P), with P being the probability that a base is incorrect. So if the quality score is 30, this means there is a 0.1% chance (P = 0.001) that the base is incorrect.
The most common Kmers (of length 5) are shown below. Hover over each bar to see the exact value for that Kmer.
Picard and Java are included into the pipeline through installation by the BRC’s CondaSysReqs package. CondaSysReqs installs external software requirements in a dedicated Conda environment versions alongside the pipeline version.
The pipeline generates the AlignmentSummaryMetrics Picard quality control software.
Various Picard metrics related to the alignment are shown. To get a description of each metric, hover over the name or look at the table of descriptions below
| Metric | FIRST_OF_PAIR | SECOND_OF_PAIR |
|---|---|---|
| TOTAL_READS | 27,017,610 | 27,017,610 |
| PF_READS | 27,017,610 | 27,017,610 |
| PCT_PF_READS | 1 | 1 |
| PF_NOISE_READS | 0 | 0 |
| PF_READS_ALIGNED | 25,587,038 | 25,345,455 |
| PCT_PF_READS_ALIGNED | 0.947 | 0.9381 |
| PF_HQ_ALIGNED_READS | 23,091,780 | 22,743,989 |
| MEAN_READ_LENGTH | 76 | 76 |
| READS_ALIGNED_IN_PAIRS | 24,882,389 | 24,882,389 |
| PCT_READS_ALIGNED_IN_PAIRS | 0.9725 | 0.9817 |
| PF_READS_IMPROPER_PAIRS | 2,670,939 | 2,429,356 |
| PCT_PF_READS_IMPROPER_PAIRS | 0.1044 | 0.0959 |
| BAD_CYCLES | 0 | 0 |
| STRAND_BALANCE | 0.4985 | 0.5031 |
| PCT_CHIMERAS | 0.041 | 0.0416 |
| PCT_ADAPTER | 0 | 0 |
ATAC-seq specific quality metrics are retrieved using the ATACseqQC R/Bioconductor package.
These provide additional insight into the quality of the ATAC-seq data given expected features of ATAC-seq.
For samples with larger number of reads (> 40M reads) we downsample reads while maintaining paired-end read information to 40 million reads.
The first table provides annotation independent assessment of ATAC-seq quality including the rate of reads mapping in proper pairs (paired reads within defined distance of each other), the rate of reads mapping to the mitochondria, the fraction of non-duplicate reads and the two pcr bottle neck coefficients (PBC1/PBC2). For more information on these metrics, hover over the terms or see the Glossary at the bottom of page.
| Metric | Value |
|---|---|
| Proper_Pair_Rate | 0.8999 |
| Mapped_To_Mitochondria_Rate | 0.0661 |
| Non_Redundant_Fraction | 0.8605 |
| PBC_1 | 0.9425 |
| PBC_2 | 20.1489 |
The below table contains the annotation specific ATAC-seq quality metric, Transcription Start Site (TSS) Enrichment Score or TSSE. This scores assesses the enrichment of signal within a TSS compared to the surrounding area. For more information on TSSE scores, please see the Glossary at the bottom of page.
| Metric | TSSE_Score |
|---|---|
| TSSE | 38.6486 |
The plot and table below show the the insert sizes for all aligned fragments and highlights the nucleosome free, mono-nucleosome and di-nucleosome proportions of the data. The presense of clear bump in the mono-nucleosome fractions indicates a higher quality ATAC-seq dataset.
| ATAC_Fraction | Proportion_Of_Mapped |
|---|---|
| Nucleosome Free | 0.2492 |
| Mono-Nucleosome | 0.1818 |
| Di-Nucleosome | 0.1438 |
For ATAC-seq open regions are detected using MACS2.
For paired-end ATAC-seq data, multimapping and non-properly paired reads are removed using Rsamtools libraries and resulting BAM file used to call MACs peaks with default parameters in BAMPE mode.
In the below table we assess the total number of peaks and the fraction of reads mapping to peaks (FRIP).
| Metric | Value |
|---|---|
| Total number of peaks | 39,304 |
| Fraction of reads in peaks (FRIP) | 0.4755 |
We further assess the distribution of signal across genomic annotation as shown in the below table. As these regions are non-exclusive and are overlapping, the signal in overlapping regions will be counted in both regions and so the sum of table and reads in intergenic regions will not add up to 1.
| Annotation | Metric |
|---|---|
| LongPromoter20000to2000 | 0.1273 |
| Promoters2000to500 | 0.011 |
| Promoters500 | 0.0037 |
| All5utrs | 0.0034 |
| Alltranscripts | 0.4842 |
| Allcds | 0.0117 |
| Allintrons | 0.4628 |
| All3utrs | 0.0141 |
| Package | Citation |
|---|---|
| Picard Tools | https://broadinstitute.github.io/picard/ |
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server 7.5 (Maipo)
##
## Matrix products: default
## BLAS: /rugpfs/fs0/ruit/scratch/tcarroll/R-3.5.1/lib/libRblas.so
## LAPACK: /rugpfs/fs0/ruit/scratch/tcarroll/R-3.5.1/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 rio_0.5.16
## [3] GlossyBRC_0.1.0 plotly_4.8.0
## [5] kableExtra_1.2.1 magrittr_1.5
## [7] forcats_0.3.0 stringr_1.3.1
## [9] dplyr_0.7.6 purrr_0.2.5
## [11] readr_1.3.1 tidyr_0.8.1
## [13] tibble_1.4.2 tidyverse_1.2.1
## [15] cowplot_0.9.3 DT_0.4
## [17] ChIPQC_1.16.2 DiffBind_2.8.0
## [19] SummarizedExperiment_1.10.1 DelayedArray_0.6.6
## [21] BiocParallel_1.14.2 matrixStats_0.54.0
## [23] Biobase_2.40.0 GenomicRanges_1.32.7
## [25] GenomeInfoDb_1.16.0 IRanges_2.14.12
## [27] S4Vectors_0.18.3 BiocGenerics_0.26.0
## [29] ggplot2_3.0.0 jsonlite_1.5
## [31] rjson_0.2.20 rmarkdown_1.10
##
## loaded via a namespace (and not attached):
## [1] sendmailR_1.2-1
## [2] SparseM_1.77
## [3] rtracklayer_1.40.6
## [4] AnnotationForge_1.22.2
## [5] prabclus_2.2-6
## [6] acepack_1.4.1
## [7] bit64_0.9-7
## [8] knitr_1.20
## [9] data.table_1.11.4
## [10] rpart_4.1-13
## [11] hwriter_1.3.2
## [12] RCurl_1.95-4.11
## [13] GenomicFeatures_1.34.4
## [14] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2
## [15] RSQLite_2.1.1
## [16] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [17] bit_1.1-14
## [18] webshot_0.5.1
## [19] xml2_1.2.0
## [20] lubridate_1.7.4
## [21] httpuv_1.4.5
## [22] assertthat_0.2.0
## [23] viridis_0.5.1
## [24] amap_0.8-16
## [25] tximport_1.8.0
## [26] hms_0.4.2
## [27] promises_1.0.1
## [28] evaluate_0.11
## [29] TSP_1.1-6
## [30] DEoptimR_1.0-8
## [31] progress_1.2.0
## [32] caTools_1.17.1.1
## [33] dendextend_1.8.0
## [34] readxl_1.1.0
## [35] Rgraphviz_2.24.0
## [36] DBI_1.0.0
## [37] geneplotter_1.58.0
## [38] htmlwidgets_1.2
## [39] ngsPipeR_0.9.431
## [40] crosstalk_1.0.0
## [41] backports_1.1.2
## [42] BatchJobs_1.7
## [43] trimcluster_0.1-2.1
## [44] annotate_1.58.0
## [45] biomaRt_2.36.1
## [46] prettydoc_0.2.1
## [47] Rfastp_0.1.2
## [48] withr_2.1.2
## [49] BSgenome_1.48.0
## [50] robustbase_0.93-2
## [51] checkmate_1.8.5
## [52] GenomicAlignments_1.16.0
## [53] gclus_1.3.1
## [54] prettyunits_1.0.2
## [55] mclust_5.4.1
## [56] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
## [57] cluster_2.0.7-1
## [58] lazyeval_0.2.1
## [59] crayon_1.3.4
## [60] genefilter_1.62.0
## [61] edgeR_3.22.3
## [62] pkgconfig_2.0.2
## [63] labeling_0.3
## [64] nlme_3.1-137
## [65] seriation_1.2-3
## [66] nnet_7.3-12
## [67] bindr_0.1.1
## [68] rlang_0.2.2
## [69] diptest_0.75-7
## [70] registry_0.5
## [71] GOstats_2.46.0
## [72] modelr_0.1.2
## [73] cellranger_1.1.0
## [74] rprojroot_1.3-2
## [75] graph_1.58.0
## [76] Matrix_1.2-14
## [77] chipseq_1.30.0
## [78] Rsubread_1.30.6
## [79] base64enc_0.1-3
## [80] whisker_0.3-2
## [81] pheatmap_1.0.10
## [82] viridisLite_0.3.0
## [83] BSgenome.Mmusculus.UCSC.mm10_1.4.0
## [84] bitops_1.0-6
## [85] KernSmooth_2.23-15
## [86] Biostrings_2.48.0
## [87] blob_1.1.1
## [88] ShortRead_1.38.0
## [89] brew_1.0-6
## [90] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
## [91] scales_1.0.0
## [92] memoise_1.1.0
## [93] GSEABase_1.42.0
## [94] plyr_1.8.4
## [95] gplots_3.0.1
## [96] gdata_2.18.0
## [97] zlibbioc_1.26.0
## [98] compiler_3.5.1
## [99] RColorBrewer_1.1-2
## [100] DESeq2_1.20.0
## [101] Rsamtools_1.32.3
## [102] cli_1.0.0
## [103] systemPipeR_1.14.0
## [104] XVector_0.20.0
## [105] Category_2.46.0
## [106] htmlTable_1.12
## [107] Formula_1.2-3
## [108] MASS_7.3-50
## [109] tidyselect_0.2.4
## [110] widgetframe_0.3.1
## [111] stringi_1.2.4
## [112] yaml_2.2.0
## [113] locfit_1.5-9.1
## [114] latticeExtra_0.6-28
## [115] ggrepel_0.8.0
## [116] grid_3.5.1
## [117] tools_3.5.1
## [118] rstudioapi_0.7
## [119] foreach_1.4.4
## [120] foreign_0.8-71
## [121] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [122] DEXSeq_1.26.0
## [123] gridExtra_2.3
## [124] digest_0.6.17
## [125] shiny_1.1.0
## [126] fpc_2.1-11.1
## [127] Rcpp_1.0.4.6
## [128] broom_0.5.0
## [129] later_0.7.4
## [130] org.Hs.eg.db_3.6.0
## [131] httr_1.4.1
## [132] Nozzle.R1_1.1-1
## [133] AnnotationDbi_1.44.0
## [134] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2
## [135] kernlab_0.9-27
## [136] colorspace_1.3-2
## [137] rvest_0.3.2
## [138] XML_3.98-1.16
## [139] topGO_2.32.0
## [140] splines_3.5.1
## [141] RBGL_1.56.0
## [142] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
## [143] statmod_1.4.30
## [144] flexmix_2.3-14
## [145] xtable_1.8-3
## [146] BBmisc_1.11
## [147] heatmaply_0.15.2
## [148] modeltools_0.2-22
## [149] R6_2.2.2
## [150] Hmisc_4.1-1
## [151] mime_0.5
## [152] pillar_1.3.0
## [153] htmltools_0.3.6
## [154] glue_1.3.0
## [155] class_7.3-14
## [156] codetools_0.2-15
## [157] mvtnorm_1.1-0
## [158] lattice_0.20-35
## [159] curl_3.2
## [160] gtools_3.8.1
## [161] zip_2.0.4
## [162] GO.db_3.6.0
## [163] openxlsx_4.1.0.1
## [164] survival_2.42-6
## [165] limma_3.36.3
## [166] munsell_0.5.0
## [167] GenomeInfoDbData_1.1.0
## [168] iterators_1.0.10
## [169] haven_2.0.0
## [170] reshape2_1.4.3
## [171] gtable_0.2.0
## [172] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2